Fast Author Name Disambiguation in CiteSeer

نویسندگان

  • Jian Huang
  • Seyda Ertekin
  • C. Lee Giles
چکیده

Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative machine learning framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, DBSCAN, clusters papers by author. The distance metric between papers used in DBSCAN is calculated by an online active selection support vector machine algorithm (LASVM), yielding a simpler model, lower test errors and faster prediction time than a standard SVM. We prove that by recasting transitivity as density reachability in DBSCAN, transitivity is guaranteed for core points. For evaluation, we manually annotated 3,355 papers yielding 490 authors and achieved 90.6% pairwise-F1 metric. For scalability, authors in the entire CiteSeer dataset, over 700,000 papers, were readily disambiguated.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

بهبود صحت ابهام‌زدایی نام نویسنده با استفاده از خوشه‌بندی تجمّعی

Today, digital libraries are important academic resources including millions of citations and bibliographic essential information such as titles, author's names and location of publications. From the view of knowledge accumulation management, the ability to search fast, accurate, desired contents, has a great importance. The complexity and similarity in these resources cause many challenges and...

متن کامل

Efficient Name Disambiguation for Large-Scale Databases

Name disambiguation can occur when one is seeking a list of publications of an author who has used different name variations and when there are multiple other authors with the same name. We present an efficient integrative framework for solving the name disambiguation problem: a blocking method retrieves candidate classes of authors with similar names and a clustering method, DBSCAN, clusters p...

متن کامل

Extracting Citation Relationships from Web Documents for Author Disambiguation

Disambiguating the citation records of authors with the same name is a very interesting and challenging problem that affects many research and application fields, such as digital libraries. However, current bibliographic digital libraries like CiteSeer can not correctly disambiguate citation records because of two problems: information sparsity (citations for an individual have few or no common...

متن کامل

Scholarly big data information extraction and integration in the CiteSeerχ digital library

CiteSeer is a digital library that contains approximately 3.5 million scholarly documents and receives between 2 and 4 million requests per day. In addition to making documents available via a public Website, the data is also used to facilitate research in areas like citation analysis, co-author network analysis, scalability evaluation and information extraction. The papers in CiteSeer are gath...

متن کامل

ALIAS: Author Disambiguation in Microsoft Academic Search Engine Dataset

We present a system called ALIAS, that is designed to search for duplicate authors from Microsoft Academic Search Engine dataset. Author-ambiguity is a prevalent problem in this dataset, as many authors publish under several variations of their own name, or different authors share similar or same name. ALIAS takes an author name as an input (who may or may not exist in the corpus), and outputs ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006